Authors: Amir Yunus, Michelle Ng, Ragunathan Theleeban
GitHub: https://github.com/AmirYunus/GA_DSI_Project_4
West Nile virus is most commonly spread to humans through infected mosquitos. Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.
In 2002, the first human cases of West Nile virus were reported in Chicago. By 2004, the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today.
Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.
Given weather, location, testing, and spraying data, we are tasked to predict when and where different species of mosquitos will test positive for West Nile virus. A more accurate method of predicting outbreaks of West Nile virus in mosquitos will help the City of Chicago and CPHD more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus.
In this report, we will consider 4 datasets, namely, df_train, df_test, df_spray and df_weather. We will also import df_sample as the sample submission for Kaggle.
Once the datasets are imported, we will explore each feature. Feature engineering comes next as we transform the date and weather features. Categorical features are also transformed to dummy variables.
Finally, we will train our model using GridSearch, of which, the best model will be used for our Kaggle submission.
import warnings
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from datetime import datetime
from sklearn import ensemble, preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from IPython.display import display, Markdown
# Prevent warnings from distracting the reader
warnings.filterwarnings('ignore')
# Colour scheme and style selected
theme = ['#1F306E', '#553772', '#8F3B76', '#C7417B', '#F5487F']
colors_palette = sns.palplot(sns.color_palette(theme))
plt.style.use('seaborn')
sns.set(style="white", color_codes=True)
sns.set_palette(colors_palette)
# Forces Matplotlib to use high-quality images
ip = get_ipython()
ibe = ip.configurables[-1]
ibe.figure_formats = {'pdf', 'png'}
# Functions to extract month and day from dataset
def create_month(x):
"""
function made to extract month out of date
Parameters
----------
x: str variable containing datein YY-MM-DD format
Returns
------
str containing month
"""
return x.split('-')[1]
def create_day(x):
"""
function made to extract day out of date
Parameters
----------
x: str variable containing date in YY-MM-DD format
Returns
------
str containing day
"""
return x.split('-')[2]
def pipeline(items, X_train, X_val, y_train, y_val):
"""
function to create a pipeline that determines which test and parameters
have the highest score
Parameters
---------
items:
X_train: training dataset containing the predicting variables
X_val: validation dataset
y_train: training dataset containing the predicted variables
y_val: validation dataset
Returns
------
train_score: R2 score of the training model
test_score: R2 score of the test model
roc_auc: Receiver Operator Curve Area under the Curve score
"""
# Add a pipe, add a param !
pipe_items = {
'lgr': LogisticRegression(),
'rfc': RandomForestClassifier(),
'gbc': GradientBoostingClassifier(),
'abc': AdaBoostClassifier(),
'knn': KNeighborsClassifier()
}
# Include at least one param for each pipe item
param_items = {
'ss': {
},
'lgr': {
},
'rfc': {
},
'gbc': {
},
'abc': {
},
'knn': {
}
}
# Create the parameters for GridSearch
params = dict()
for i in items:
for p in param_items[i]:
params[p] = param_items[i][p]
# Create the pipeline
pipe_list = [(i, pipe_items[i]) for i in items]
for p in pipe_list:
display(Markdown(f'<b>{str(p[1]).split("(")[0]}</b>'))
pipe = Pipeline(pipe_list)
# Grid search
gs = GridSearchCV(pipe, param_grid=params,scoring='roc_auc', verbose=1, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)
# Print the results
train_params = gs.best_params_
train_score = gs.score(X_train,y_train)
test_score = gs.score(X_val,y_val)
y_val_hat = gs.predict(X_val)
y_val_proba = gs.predict_proba(X_val)
df_val_proba = pd.DataFrame(y_val_proba)
y_val_proba_0 = df_val_proba[1]
roc_auc = roc_auc_score(y_val, y_val_proba_0)
for k in train_params:
print("{}: {}".format(k, train_params[k]))
print("Train score: {} Test score: {}".format(train_score,test_score))
print("ROC AUC score: {}".format(roc_auc))
print("")
return train_score, test_score, roc_auc
def log ():
"""
returns the current datetime
Parameters
----------
takes computer time
Returns
-------
Current computer time
"""
now = str(datetime.now())
print(f'[LOG {now}]')
def single_barplot(df,x,y,title,xlab,ylab):
"""
returns barplot
Parameters
----------
df: Dataframe where data comes from
x: column of dataframe of interest
y: column of dataframe of interest
title: title of plot
xlab: label of x axis
ylab: label of y axis
Returns
-------
Returns plotted bargraph
"""
df[[x, y]].groupby(x).sum().plot.bar(figsize=(20,11))
plt.title(title, fontsize=30)
plt.xlabel(xlab, fontsize =20)
plt.xticks(rotation=90)
plt.ylabel(ylab, fontsize=20)
plt.grid(b = True)
def state_postcodeextract(df):
"""
To make new dataframe columns that contain state of the address
Parameters
----------
df: DataFrame of concern
Output
------
Returns the first 5 rows of the dataframe
"""
df['state'] = ['IL' if 'IL' in x else 'somewhereelse' for x in df['Address']]
return df.head()
# function created to do correlation between the WNV probability and precipitation mean
def pearson(lst, df):
"""
takes a list of dataframe columns and runs Pearson's correlation against WnvPresence probability
Parameters
----------
lst: list of dataframe columns
df: Dataframe containing the columns
"""
for a in lst:
(stat, p) = stats.pearsonr(df[a], df.WnvPresent)
print(a, 'r:', stat, 'p:', p)
We will conduct a descriptive analysis of the West Nile Virus. In addition, we will apply some necessary pre-processing steps to train our model. This report is based on the "West Nile Virus Prediction" via Kaggle [1].
Let us start by loading all the datasets and investigate its structure and attributes:
log ()
# Load dataset
df_train = pd.read_csv('../data/train.csv')
print(f'Training dataset of {df_train.shape[0]} observations and {df_train.shape[1]} features loaded')
df_test = pd.read_csv('../data/test.csv')
print(f'Testing dataset of {df_test.shape[0]} observations and {df_test.shape[1]} features loaded')
df_spray = pd.read_csv('../data/spray.csv')
print(f'Spray dataset of {df_spray.shape[0]} observations and {df_spray.shape[1]} features loaded')
df_weather = pd.read_csv('../data/weather.csv')
print(f'Weather dataset of {df_weather.shape[0]} observations and {df_weather.shape[1]} features loaded')
df_sample = pd.read_csv('../data/sampleSubmission.csv')
print(f'Sample Submission dataset of {df_sample.shape[0]} observations and {df_sample.shape[1]} features loaded')
Overview:
log()
# Check the shape of df_train
print(f'Training dataset has {df_train.shape[0]} observations and {df_train.shape[1]} features')
# Check for missing values
print(f'Training dataset have {df_train.isnull().sum().sum()} NaNs')
# Check for duplicates
print(f'Training dataset have {df_train[df_train.duplicated(list(df_train.columns))].shape[0]} duplicated observations')
log()
# display columns in df_train
df_train.columns
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[0]}</b> - {df_train.Date.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Date.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Date.isnull().sum()
if n_nan > 0:
print(f'{df_train.Date} has {n_nan} NaNs')
log()
# plot distribution of months collected
df_train['month'] = [create_month(x) for x in df_train['Date']]
plt.figure(figsize = (16,9))
df_train['month'].value_counts().plot(kind='barh',)
plt.title('Distribution of months training data was collected',{'size': 15})
plt.xlabel('Number of training data collected in a month', {'size': 15})
plt.ylabel('Month of training data collected', {'size': 15})
The data shows that the bulk of the training data was collected between 6th to 9th month of the year. This aligns with the summer months in Chicago.
log()
# plot month against west nile virus
single_barplot(df=df_train,
x='month',
y='WnvPresent',
title='Month Vs Number of West Nile Virus Detected',
xlab='Months for the years (2007, 2009, 2011, 2013)',
ylab='West Nile Virus detections')
log()
f,ax=plt.subplots(1,2,figsize=(18,9))
#Months Vs NumMosquitos
ax[0].set_title('Month of the year Vs Number of Mosquitos found')
ax[0]=sns.lineplot(x='month', y='NumMosquitos',data=df_train,ax=ax[0],palette="RdBu")
ax[0].grid(b = True)
#Months Vs WnvPresent
ax[1].set_title('Month of the year Vs Number of Mosquitos with WNV found')
ax[1]=sns.barplot(x='month', y='WnvPresent',hue='Species',data=df_train,palette="RdBu",ax=ax[1])
ax[1].grid(b = True)
Over Years of 2007, 2009, 2011, 2013 the highest total number of WNV detections were observed on the mouth of August. And carrier species number also high in summer season of the year.
In addition, over years of 2007, 2009, 2011, 2013 the highest number of mosquitos were observed on the mouth of August with steady increase from month of May then decreased after month of August. And carrier species amount also followed same trend as number of mosquitos.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[1]}</b> - {df_train.Address.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Address.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Address.isnull().sum()
if n_nan > 0:
print(f'{df_train.Address} has {n_nan} NaNs')
log()
# display state
state_postcodeextract(df_train)
log()
df_train['state'].value_counts()
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[2]}</b> - {df_train.Species.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Species.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Species.isnull().sum()
if n_nan > 0:
print(f'{df_train.Species} has {n_nan} NaNs')
log()
df_train[['Species','WnvPresent','NumMosquitos']].groupby(['Species', 'WnvPresent']).sum()
log()
# species of west nile
plt.figure(figsize=(16,9))
ax=sns.barplot(x='Species', y='NumMosquitos', hue='WnvPresent',data=df_train,palette="RdBu")
plt.title("Species Vs Number of West Nile Virus Detected", fontsize='large')
L = plt.legend(loc="upper right")
L.get_texts()[0].set_text('West Nile Not Detected')
L.get_texts()[1].set_text('West Nile Detected')
plt.xlabel("Species", fontsize ='large')
plt.ylabel("West Nile Virus detections", fontsize='large')
Culex Pipiens and Culex Restuans have the highest populations in the training set. Amongst the species represented, the West Nile Virus was present only in Culex Pipiens and Culex Restuans which makes them the carriers as compared to the others. Number of WNV carrier mosquitos matter in spreading illness, not over all number of mosquitos.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[3]}</b> - {df_train.Block.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Block.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Block.isnull().sum()
if n_nan > 0:
print(f'{df_train.Block} has {n_nan} NaNs')
Most of the training data were collected in block 10.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[4]}</b> - {df_train.Street.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Street.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Street.isnull().sum()
if n_nan > 0:
print(f'{df_train.Street} has {n_nan} NaNs')
Most of the traps laid were at O'Hare Airport, as most of the traps recorded up to 50 mosquitos.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[5]}</b> - {df_train.Trap.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Trap.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Trap.isnull().sum()
if n_nan > 0:
print(f'{df_train.Trap} has {n_nan} NaNs')
log()
# display trap T900
df_train[df_train['Trap'] == "T900"]
The most used trap is T900, as it is placed at O'Hare airport, which collected the most mosquitos.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[6]}</b> - {df_train.AddressNumberAndStreet.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.AddressNumberAndStreet.value_counts(normalize=True)}\n')
print()
n_nan = df_train.AddressNumberAndStreet.isnull().sum()
if n_nan > 0:
print(f'{df_train.AddressNumberAndStreet} has {n_nan} NaNs')
Again, the most data was collected at O'Hare airport.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[7]}</b> - {df_train.Latitude.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Latitude.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Latitude.isnull().sum()
if n_nan > 0:
print(f'{df_train.Latitude} has {n_nan} NaNs')
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[8]}</b> - {df_train.Longitude.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.Longitude.value_counts(normalize=True)}\n')
print()
n_nan = df_train.Longitude.isnull().sum()
if n_nan > 0:
print(f'{df_train.Longitude} has {n_nan} NaNs')
Both longitude and latitude had similar proportions, as they both pointed to O'Hare Airport which has the highest amount of mosquitos.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[9]}</b> - {df_train.AddressAccuracy.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.AddressAccuracy.value_counts(normalize=True)}\n')
print()
n_nan = df_train.AddressAccuracy.isnull().sum()
if n_nan > 0:
print(f'{df_train.AddressAccuracy} has {n_nan} NaNs')
70% of the addresses have an address accuracy of 8 and 9, which means 30% are not as accurate.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[10]}</b> - {df_train.NumMosquitos.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.NumMosquitos.value_counts(normalize=True)}\n')
print()
n_nan = df_train.NumMosquitos.isnull().sum()
if n_nan > 0:
print(f'{df_train.NumMosquitos} has {n_nan} NaNs')
log()
df_train[df_train['NumMosquitos'] == 50]['AddressNumberAndStreet'].value_counts().head()
Most traps show that only 1 or 2 mosquitos were caught, however, the next highest number was 50. 50 is the maximum entry that can be put in, and thus multiple entries were made if there were more than 50 mosquitos. The area with most amount of 50 mosquito entries are S Doty Ave, S Stony Island Ave (Both near a marsh) and W O'Hare Airport.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_train.columns[11]}</b> - {df_train.WnvPresent.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_train.WnvPresent.value_counts(normalize=True)}\n')
print()
n_nan = df_train.WnvPresent.isnull().sum()
if n_nan > 0:
print(f'{df_train.WnvPresent} has {n_nan} NaNs')
log()
set1 = df_train[df_train['WnvPresent'] == 0]['NumMosquitos']
set2 = df_train[df_train['WnvPresent'] == 1]['NumMosquitos']
t, p = stats.ttest_ind(set1, set2)
print(set1.mean(), set2.mean(), t, p)
There is only a small proportion of WNV present in the population.
The dataset is highly imbalanced, where the WNV present mosquitos are only present in 5% of the cases.
log()
# baseline score
df_train.WnvPresent.value_counts(normalize = True)
There is only 5% of the population with West Nile Virus in the training dataset. Though it is a good thing but this makes the dataset imbalanced.
log()
# Get labels
labels = df_train.WnvPresent.values
log()
# display df_train distribution
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
temp_df=df_train[df_train['WnvPresent']==0]
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c = 'lime', s=300, label = 'Trap')
temp_df=df_train[df_train['WnvPresent']==1]
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c='r', s=400,marker='+', label= 'WNV Present')
ax.set_title('Plot of Trap locations and WNV Present locations in train dataset', size = 20)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
ax.legend()
Overview:
log()
# Check the shape of df_train
print(f'Training dataset has {df_test.shape[0]} observations and {df_test.shape[1]} features')
# Check for missing values
print(f'Training dataset have {df_test.isnull().sum().sum()} NaNs')
# Check for duplicates
df_columns = (list(df_test.columns))
df_columns.remove('Id')
print(f'Training dataset have {df_test[df_test.duplicated(df_columns)].shape[0]} duplicated observations')
The testing dataset is bigger than the training dataset
log()
df_test.columns
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[1]}</b> - {df_test.Date.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Date.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Date.isnull().sum()
if n_nan > 0:
print(f'{df_test.Date} has {n_nan} NaNs')
log()
df_test[df_test['Date'] == '2012-07-09']['Address'].value_counts()
The highest amount of traps recorded in the testing dataset was on 2012-07-09, with the highest number of traps recorded were from O'Hare Airport.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[2]}</b> - {df_test.Address.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Address.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Address.isnull().sum()
if n_nan > 0:
print(f'{df_test.Address} has {n_nan} NaNs')
The highest amount of entries recorded were from O'Hare International Airport
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[3]}</b> - {df_test.Species.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Species.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Species.isnull().sum()
if n_nan > 0:
print(f'{df_test.Species} has {n_nan} NaNs')
Culex Pipiens and Restuans have the highest proportion of species altogether. However, in this test dataset, there seems to be a rather balanced proportion for other species, while in the training dataset there is an unbalanced proportion of other species.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[4]}</b> - {df_test.Block.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Block.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Block.isnull().sum()
if n_nan > 0:
print(f'{df_test.Block} has {n_nan} NaNs')
In the testing dataset, there is a more even spread of data amongst the blocks. Though block 10 is still the largest, it is not as far apart from the next highest proportion block as compared to the training dataset.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[5]}</b> - {df_test.Street.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Street.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Street.isnull().sum()
if n_nan > 0:
print(f'{df_test.Street} has {n_nan} NaNs')
log()
df_test[df_test['Street']== ' N OAK PARK AVE']
There is also a relatively more even spread of test data collected for each road. The only road that remains on the top 5 entries is S Stony Island Ave. This is possibly so because this test dataset is more balanced.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[6]}</b> - {df_test.Trap.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Trap.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Trap.isnull().sum()
if n_nan > 0:
print(f'{df_test.Trap} has {n_nan} NaNs')
Like the street addresses, there is a relatively even proportion of traps used. This also could be because of the more balanced dataset.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[7]}</b> - {df_test.AddressNumberAndStreet.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.AddressNumberAndStreet.value_counts(normalize=True)}\n')
print()
n_nan = df_test.AddressNumberAndStreet.isnull().sum()
if n_nan > 0:
print(f'{df_test.AddressNumberAndStreet} has {n_nan} NaNs')
This is slightly different to the street variable that was looked at earlier as that did not include street number.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[8]}</b> - {df_test.Latitude.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Latitude.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Latitude.isnull().sum()
if n_nan > 0:
print(f'{df_test.Latitude} has {n_nan} NaNs')
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[9]}</b> - {df_test.Longitude.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.Longitude.value_counts(normalize=True)}\n')
print()
n_nan = df_test.Longitude.isnull().sum()
if n_nan > 0:
print(f'{df_test.Longitude} has {n_nan} NaNs')
Similarly to the training dataset, the highest proportion on records belong to the latitude and longitude of 41.974689, -87.890615, which is in O'Hare Airport
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_test.columns[10]}</b> - {df_test.AddressAccuracy.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_test.AddressAccuracy.value_counts(normalize=True)}\n')
print()
n_nan = df_test.AddressAccuracy.isnull().sum()
if n_nan > 0:
print(f'{df_test.AddressAccuracy} has {n_nan} NaNs')
log()
df_test[df_test['AddressAccuracy']== 3]['AddressNumberAndStreet']
Again, the Address Accuracy shows that 67% of the addresses are under the accuracy score of 8 and 9, while the rest has low address accuracy.
Overview:
log()
# Check the shape of df_train
print(f'Training dataset has {df_spray.shape[0]} observations and {df_spray.shape[1]} features')
# Check for missing values
print(f'Training dataset have {df_spray.isnull().sum().sum()} NaNs')
# Check for duplicates
print(f'Training dataset have {df_spray[df_spray.duplicated(list(df_spray.columns))].shape[0]} duplicated observations')
The dataset has 14835 observations with 584 NaNs and 541 duplicated observations.
log()
df_spray.columns
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_spray.columns[0]}</b> - {df_spray.Date.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_spray.Date.value_counts(normalize=True)}\n')
print()
n_nan = df_spray.Date.isnull().sum()
if n_nan > 0:
print(f'{df_spray.Date} has {n_nan} NaNs')
The bulk of the sprays took place in 2013, while there were only 2 sessions of spray in 2011.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_spray.columns[1]}</b> - {df_spray.Time.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_spray.Time.value_counts(normalize=True)}\n')
print()
n_nan = df_spray.Time.isnull().sum()
if n_nan > 0:
print(f'{df_spray.Time} has {n_nan} NaNs')
log()
df_spray[df_spray['Time'] == "7:44:32 PM"]
The highest amount of repeats were at 7.44pm. However upon closer inspection, it is uncertain if it was logging error or not as there were 541 rows that were similar. Might be an area covered that has the center of 41.98646, -87.794225
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_spray.columns[2]}</b> - {df_spray.Latitude.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_spray.Latitude.value_counts(normalize=True)}\n')
print()
n_nan = df_spray.Latitude.isnull().sum()
if n_nan > 0:
print(f'{df_spray.Latitude} has {n_nan} NaNs')
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_spray.columns[3]}</b> - {df_spray.Longitude.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_spray.Longitude.value_counts(normalize=True)}\n')
print()
n_nan = df_spray.Longitude.isnull().sum()
if n_nan > 0:
print(f'{df_spray.Longitude} has {n_nan} NaNs')
The longitude and latitude most mentioned in this dataset is 41.98646, -87.794225, which is where the most of the sprays happened at 7.44PM. This might be an area covered that has the center of 41.98646, -87.794225
log()
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(df_spray.Longitude, df_spray.Latitude, zorder=1, c= 'g', s=150)
ax.set_title('Plot of Spray Locations', size = 20)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
As shown in the map above, only a small subset of places were chosen for spraying. In addition, the places chosen were not demarcated by suburb borders.
In addition, this dataset has a large proportion of duplicates and possibly have wrong entries, it is not as beneficial to run tests on these data in the model.
Overview:
log()
# Check the shape of df_train
print(f'Training dataset has {df_weather.shape[0]} observations and {df_weather.shape[1]} features')
# Check for missing values
print(f'Training dataset have {df_weather.isnull().sum().sum()} NaNs')
# Check for duplicates
print(f'Training dataset have {df_weather[df_weather.duplicated(list(df_weather.columns))].shape[0]} duplicated observations')
The weather dataset has 2944 observations and has no duplicates
log()
df_weather.columns
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[0]}</b> - {df_weather.Station.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Station.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Station.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Station} has {n_nan} NaNs')
The Stations can be exactly split into half into station 1 and 2
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[1]}</b> - {df_weather.Date.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Date.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Date.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Date} has {n_nan} NaNs')
The weather dataset shows an even spread of all dates.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[2]}</b> - {df_weather.Tmax.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Tmax.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Tmax.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Tmax} has {n_nan} NaNs')
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['Tmax'], bins = 20)
plt.title('Distribution of Min Temperature data', size = 15)
plt.xlabel('Temperature')
plt.ylabel('Count')
log()
print(df_weather['Tmax'].mean(), df_weather['Tmax'].min(), df_weather['Tmax'].max())
During the period of May to October, the average Tmax is about 76, while the Min-Max is 41 - 104. However, the highest proporition of temperatures lad around the 79 - 84 mark, suggesting that the distribution of temperatures is positively skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[3]}</b> - {df_weather.Tmin.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Tmin.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Tmin.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Tmin} has {n_nan} NaNs')
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['Tmin'], bins = 20)
plt.title('Distribution of Min Temperature data', size = 15)
plt.xlabel('Temperature')
plt.ylabel('Count')
log()
print(df_weather['Tmin'].mean(), df_weather['Tmin'].min(), df_weather['Tmin'].max())
The average Tmin is 57 degrees, while the min - max is 29 - 83 degrees. However, the larger proportion of temperatures are between 60-68 degrees. This might suggest that the Tmin data is slightly positively skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[4]}</b> - {df_weather.Tavg.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Tavg.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Tavg.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Tavg} has {n_nan} NaNs')
log()
df_weather.Tavg.unique()
df_weather[df_weather['Tavg']== 'M']['Tavg'].count()
log()
df_weather['Tavg'] = df_weather['Tavg'].replace('M', np.nan)
df_weather['Tavg'] = [float(x) for x in df_weather['Tavg']]
df_weather['Tavg'].unique()
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['Tavg'], bins = 20)
plt.title('Distribution of Tavg Temperature data', size = 15)
plt.xlabel('Temperature', size = 15)
plt.ylabel('Count', size = 15)
log()
print(df_weather['Tavg'].mean(), df_weather['Tavg'].min(), df_weather['Tavg'].max())
The Tavg data contains 11 missing data. Of the valid dataset, the highest proportion of Tavg lies between 70 - 77 degrees. However, the mean calculated is 67 degrees. This would mean that the weather data is positively skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[5]}</b> - {df_weather.Depart.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Depart.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Depart.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Depart} has {n_nan} NaNs')
The Departure from normal shows that there is a large proportion of missing values. There is basis for dropping the column.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[6]}</b> - {df_weather.DewPoint.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.DewPoint.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.DewPoint.isnull().sum()
if n_nan > 0:
print(f'{df_weather.DewPoint} has {n_nan} NaNs')
log()
print(df_weather['DewPoint'].mean(), df_weather['DewPoint'].min(), df_weather['DewPoint'].max())
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['DewPoint'], bins = 20)
plt.title('Distribution of Dewpoint temperature data', size = 15)
plt.xlabel('Temperature')
plt.ylabel('Count')
The dewpoint data shows that there is a mean of 53 degrees, with a mode of 55 - 59 degrees. This shows that the distribution of DewPoint data is positively skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[7]}</b> - {df_weather.WetBulb.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.WetBulb.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.WetBulb.isnull().sum()
if n_nan > 0:
print(f'{df_weather.WetBulb} has {n_nan} NaNs')
log()
df_weather[df_weather['WetBulb'] == 'M']['WetBulb'].count()
log()
df_weather['WetBulb'] = df_weather['WetBulb'].replace('M', np.nan)
df_weather['WetBulb'] = [float(x) for x in df_weather['WetBulb']]
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['WetBulb'], bins = 30)
plt.title('Distribution of Wetbulb temperature data', size = 15)
plt.xlabel('Temperature', size = 15)
plt.ylabel('Count', size = 15)
log()
print(df_weather['WetBulb'].mean(), df_weather['WetBulb'].min(),
df_weather['WetBulb'].max(), df_weather['WetBulb'].median())
There are 4 missing values in the wetbulb data. The mean of the data is at 59 deg and the larger proportion of the data lies in the 59 to 65 degrees range. The histogram shows that it is positively skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[8]}</b> - {df_weather.Heat.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Heat.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Heat.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Heat} has {n_nan} NaNs')
log()
df_weather['Heat'] = df_weather['Heat'].replace('M', np.nan)
df_weather['Heat'] = [float(x) for x in df_weather['Heat']]
df_weather.info()
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['Heat'], bins = 30)
plt.title('Distribution of Heat temperature data', size = 15)
plt.xlabel('Temperature', size = 15)
plt.ylabel('Count', size = 15)
There are 11 missing data in the heat data. There is about 60% of the data having 0 value, leading it to be negative skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[9]}</b> - {df_weather.Cool.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Cool.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Cool.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Cool} has {n_nan} NaNs')
log()
df_weather[df_weather['Cool'] == 'M']['Cool'].count()
log()
df_weather['Cool'] = df_weather['Cool'].replace('M', np.nan)
df_weather['Cool'] = [ float(x) for x in df_weather['Cool']]
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['Cool'], bins = 30)
plt.title('Distribution of Cool temperature data', size = 15)
plt.xlabel('Temperature', size = 15)
plt.ylabel('Count', size = 15)
There are 11 missing Cool data. Similarly, there is a large amount of 0 value data, which makes the distribution negatively skewed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[10]}</b> - {df_weather.Sunrise.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Sunrise.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Sunrise.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Sunrise} has {n_nan} NaNs')
The sunrise data has a large proportion of missing values. Hence, it is not advisable to use the data. However, the largest proportion of sunrises occurred around the 4 am.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[11]}</b> - {df_weather.Sunset.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Sunset.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Sunset.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Sunset} has {n_nan} NaNs')
The sunrise data has a large proportion of missing values. Hence, it is not advisable to use the data. However, the largest proportion of sunsets is around 7.30pm.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[12]}</b> - {df_weather.CodeSum.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.CodeSum.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.CodeSum.isnull().sum()
if n_nan > 0:
print(f'{df_weather.CodeSum} has {n_nan} NaNs')
There is a large proportion of empty data in this column of the dataframe. However, this does not represent missing data as this data records any weather phenomena that occurs, and thus on days that there is no data it represents that no phenomena observed.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[13]}</b> - {df_weather.Depth.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Depth.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Depth.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Depth} has {n_nan} NaNs')
There is 0 data in this column, half of which is 0 value while the other half is missing. This is to be expected as this represents snow depth, which should not happen in summer.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[14]}</b> - {df_weather.Water1.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.Water1.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.Water1.isnull().sum()
if n_nan > 0:
print(f'{df_weather.Water1} has {n_nan} NaNs')
There is only missing data as this column codes for the water volume equivalent to snow fallen.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[15]}</b> - {df_weather.SnowFall.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.SnowFall.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.SnowFall.isnull().sum()
if n_nan > 0:
print(f'{df_weather.SnowFall} has {n_nan} NaNs')
log()
df_weather[(df_weather['SnowFall'] == '0.1')| (df_weather['SnowFall'] == ' T')]['SnowFall'].value_counts()
There are 12 with Trace amounts of Snow, and 1 with 0.1 inches of Snow. In addition, there is 50% of missing data.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[16]}</b> - {df_weather.PrecipTotal.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.PrecipTotal.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.PrecipTotal.isnull().sum()
if n_nan > 0:
print(f'{df_weather.PrecipTotal} has {n_nan} NaNs')
log()
df_weather['PrecipTotal'].unique()
log()
print('missing ', df_weather[df_weather['PrecipTotal']== 'M']['PrecipTotal'].count(),
'trace ', df_weather[df_weather['PrecipTotal']== ' T']['PrecipTotal'].count())
There is 2 missing data, with 318 with Trace amounts of precipitation.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[17]}</b> - {df_weather.StnPressure.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.StnPressure.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.StnPressure.isnull().sum()
if n_nan > 0:
print(f'{df_weather.StnPressure} has {n_nan} NaNs')
log()
df_weather['StnPressure'].unique()
log()
df_weather[df_weather['StnPressure'] == 'M']['StnPressure'].count()
log()
df_weather['StnPressure'] = df_weather['StnPressure'].replace('M', np.nan)
df_weather['StnPressure'] = [float(x) for x in df_weather['StnPressure']]
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['StnPressure'], bins = 30)
plt.title('Distribution of Station Pressure data', size = 15)
plt.xlabel('Pressure', size = 15)
plt.ylabel('Count', size = 15)
log()
print(df_weather['StnPressure'].mean(), df_weather['StnPressure'].min(),
df_weather['StnPressure'].max(), df_weather['StnPressure'].median())
There are 4 missing entries in the StnPressure data. The data looks normally distributed and the mean of the test lies with the median.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[18]}</b> - {df_weather.SeaLevel.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.SeaLevel.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.SeaLevel.isnull().sum()
if n_nan > 0:
print(f'{df_weather.SeaLevel} has {n_nan} NaNs')
log()
df_weather[df_weather['SeaLevel'] == 'M']['SeaLevel'].count()
log()
df_weather['SeaLevel'] = df_weather['SeaLevel'].replace('M', np.nan)
df_weather['SeaLevel'] = [float(x) for x in df_weather['SeaLevel']]
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['SeaLevel'], bins = 30)
plt.title('Distribution of Sea level data', size = 15)
plt.xlabel('Sea level', size = 15)
plt.ylabel('Count', size = 15)
log()
print(df_weather['SeaLevel'].mean(), df_weather['SeaLevel'].min(),
df_weather['SeaLevel'].max(), df_weather['SeaLevel'].median())
There is 9 missing data found in the SeaLevel data. The data looks normally distributed, with the median lying close to the mean.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[19]}</b> - {df_weather.ResultSpeed.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.ResultSpeed.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.ResultSpeed.isnull().sum()
if n_nan > 0:
print(f'{df_weather.ResultSpeed} has {n_nan} NaNs')
log()
df_weather[df_weather['ResultSpeed'] == 'M']['ResultSpeed'].count()
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['ResultSpeed'], bins = 30)
plt.title('Distribution of Result Wind Speed data', size = 15)
plt.xlabel('Speed', size = 15)
plt.ylabel('Count', size = 15)
log()
print(df_weather['ResultSpeed'].mean(), df_weather['ResultSpeed'].min(),
df_weather['ResultSpeed'].max(), df_weather['ResultSpeed'].median())
The result wind speed data shows that there is a slight negative skew.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[20]}</b> - {df_weather.ResultDir.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.ResultDir.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.ResultDir.isnull().sum()
if n_nan > 0:
print(f'{df_weather.ResultDir} has {n_nan} NaNs')
log()
df_weather[df_weather['ResultDir'] == 'M']['ResultDir'].count()
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['ResultDir'], bins = 30)
plt.title('Distribution of Result Wind Direction data', size = 15)
plt.xlabel('Degrees', size = 15)
plt.ylabel('Count', size = 15)
There is a resultant wind direction that is not normally distributed. There is no missing values.
log()
# Display object type and values for each feature
display(Markdown(f'<b>{df_weather.columns[21]}</b> - {df_weather.AvgSpeed.dtype}'))
display(Markdown(f'Values:'))
print(f'{df_weather.AvgSpeed.value_counts(normalize=True)}\n')
print()
n_nan = df_weather.AvgSpeed.isnull().sum()
if n_nan > 0:
print(f'{df_weather.AvgSpeed} has {n_nan} NaNs')
log()
df_weather[df_weather['AvgSpeed'] == 'M']['AvgSpeed'].count()
log()
df_weather['AvgSpeed'] = df_weather['AvgSpeed'].replace('M', np.nan)
df_weather['AvgSpeed'] = [float(x) for x in df_weather['AvgSpeed']]
log()
plt.figure(figsize = (16,9))
plt.hist(df_weather['AvgSpeed'], bins = 30)
plt.title('Distribution of Average Wind Speed data', size = 15)
plt.xlabel('Speed', size = 15)
plt.ylabel('Count', size = 15)
log()
print(df_weather['AvgSpeed'].mean(), df_weather['AvgSpeed'].min(),
df_weather['AvgSpeed'].max(), df_weather['AvgSpeed'].median())
The mean average speed for wind is 8.5, and the data is negatively skewed.
log()
df = pd.merge(df_train, df_weather, how="inner", on="Date")
df= df.replace('M', -1).replace('-', -1).replace('T', -1).replace(' T', -1).replace(' T', -1)
log()
df["Tavg"] = pd.to_numeric(df["Tavg"])
df["Depart"] = pd.to_numeric(df["Depart"])
df["WetBulb"] = pd.to_numeric(df["WetBulb"])
df["Cool"] = pd.to_numeric(df["Cool"])
df["Sunrise"] = pd.to_numeric(df["Sunrise"])
df["Sunset"] = pd.to_numeric(df["Sunset"])
df["Depth"] = pd.to_numeric(df["Depth"])
df["SnowFall"] = pd.to_numeric(df["SnowFall"])
df["PrecipTotal"] = pd.to_numeric(df["PrecipTotal"])
df["StnPressure"] = pd.to_numeric(df["StnPressure"])
df["SeaLevel"] = pd.to_numeric(df["SeaLevel"])
df["AvgSpeed"] = pd.to_numeric(df["AvgSpeed"])
log()
df.head()
log()
single_barplot(df=df,x='WetBulb',
y='WnvPresent',
title='Nile Virus is detected Vs Wet Bulb temperature',
xlab='Wet Bulb temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with a higher wet bulb temperature.
log()
single_barplot(df=df,
x='Cool',
y='WnvPresent',
title='Nile Virus is detected Vs Cooling temperature',
xlab='Cooling temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with cooling turned on from 5 to 21 differential temperature.
log()
single_barplot(df=df,
x='PrecipTotal',
y='WnvPresent',
title='Nile Virus is detected Vs Total precipitation',
xlab='Total precipitation',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with 0 precipitation total.
log()
single_barplot(df=df,
x='Tavg',
y='WnvPresent',
title='Nile Virus is detected Vs Average temperature',
xlab='Average temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with a higher average temperature.
log()
single_barplot(df=df,
x='Tmax',
y='WnvPresent',
title='Nile Virus is detected Vs Maximum temperature',
xlab='Maximum temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with a higher maximum temperature.
log()
single_barplot(df=df,
x='Tmin',
y='WnvPresent',
title='Nile Virus is detected Vs Minimum temperature',
xlab='Minimum temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with a higher minimum temperature.
log()
single_barplot(df=df,
x='Heat',
y='WnvPresent',
title='Nile Virus is detected Vs Heating temperature',
xlab='Heating temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with No Heating.
log()
single_barplot(df=df,
x='SeaLevel',
y='WnvPresent',
title='Nile Virus is detected Vs Sea Level',
xlab='Sea Level',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently above sea level between 29.89 and 30.13.
log()
single_barplot(df=df,
x='AvgSpeed',
y='WnvPresent',
title='Nile Virus is detected Vs Average wind speed',
xlab='Average wind speed',
ylab='West Nile Virus detections')
No significant visual relationship found between average wind speed and WNV detection.
log()
single_barplot(df=df,
x='ResultDir',
y='WnvPresent',
title='Nile Virus is detected Vs Resultant wind direction',
xlab='Resultant wind direction',
ylab='West Nile Virus detections')
No significant visual relationship found between wind direction and WNV detection.
log()
single_barplot(df=df,
x='ResultSpeed',
y='WnvPresent',
title='Nile Virus is detected Vs Resultant wind speed',
xlab='Resultant wind speed',
ylab='West Nile Virus detections')
No significant visual relationship found between resultant wind speed and WNV detection.
log()
single_barplot(df=df,
x='DewPoint',
y='WnvPresent',
title='Nile Virus is detected Vs Dew point temperature',
xlab='Dew point temperature',
ylab='West Nile Virus detections')
West Nile Virus is detected more frequently during days with a higher dew point temperature.
log()
df_train['month'] = df_train.Date.apply(create_month)
df_train['day'] = df_train.Date.apply(create_day)
df_test['month'] = df_test.Date.apply(create_month)
df_test['day'] = df_test.Date.apply(create_day)
Month and Day variables were created for both train and test dataframes.
log()
# Add integer latitude/longitude columns
df_train['Lat_int'] = df_train.Latitude.apply(int)
df_train['Long_int'] = df_train.Longitude.apply(int)
df_test['Lat_int'] = df_test.Latitude.apply(int)
df_test['Long_int'] = df_test.Longitude.apply(int)
Longitude and latitude were changed to integers to remove the trailing decimal points
log()
# replace some missing values and T with -1
df_weather = df_weather.replace('M', -1).replace('-', -1).replace('T', -1).replace(' T', -1).replace(' T', -1)
df_weather = df_weather.replace(np.nan, -1).replace(np.nan, -1).replace(np.nan, -1).replace(np.nan, -1).replace(np.nan, -1)
We replaced all NaN values and Missing values and Trace values into -1 so we can deal with it easily afterwards
log()
# drop address columns
df_train = df_train.drop(['Address', 'AddressNumberAndStreet','WnvPresent', 'NumMosquitos'], axis = 1)
df_test = df_test.drop(['Id', 'Address', 'AddressNumberAndStreet'], axis = 1)
We have dropped address related features as there is repetition with block and street features left behind in the dataframe. We also dropped WnvPresent and NumMosquitos as it has been saved into another wnvpresent label list.
log()
# Not using codesum for this model
df_weather = df_weather.drop('CodeSum', axis=1)
CodeSum gives a qualitative description of the weather. We dropped the column as there are other columns (like PrecipiTotal) that describes rain. In addition, there is a lot of null data, which is normal as this column only describes weather phenomena.
log()
# Split station 1 and 2 and join horizontally
stn1 = df_weather[df_weather['Station']==1]
stn1 = stn1.drop('Station', axis=1)
stn2 = df_weather[df_weather['Station']==2]
stn2 = stn2.drop('Station', axis=1)
df_weather = stn1.merge(stn2, on='Date')
We have decided to merge the two station data into different columns on the same date for convenience.
log()
# Merge with weather data
df_train = df_train.merge(df_weather, on='Date')
df_train = df_train.drop(['Date'], axis = 1)
df_test = df_test.merge(df_weather, on='Date')
df_test = df_test.drop(['Date'], axis = 1)
The weather data was merged with the training and the test data.
log()
# Convert categorical data to numbers
lbl = preprocessing.LabelEncoder()
lbl.fit(list(df_train['Species'].values) + list(df_test['Species'].values))
df_train['Species'] = lbl.transform(df_train['Species'].values)
df_test['Species'] = lbl.transform(df_test['Species'].values)
lbl.fit(list(df_train['Street'].values) + list(df_test['Street'].values))
df_train['Street'] = lbl.transform(df_train['Street'].values)
df_test['Street'] = lbl.transform(df_test['Street'].values)
lbl.fit(list(df_train['Trap'].values) + list(df_test['Trap'].values))
df_train['Trap'] = lbl.transform(df_train['Trap'].values)
df_test['Trap'] = lbl.transform(df_test['Trap'].values)
The training and testing set were relabelled using Label Encoder for all categorical values
log()
# drop columns with -1s
df_train = df_train.loc[:,(df_train != -1).any(axis=0)]
df_test = df_test.loc[:,(df_test != -1).any(axis=0)]
The missing values were removed all in one go.
log()
df_train = df_train.drop(labels = 'state', axis = 1)
All entries have the same state, so it is dropped.
log()
# train test split
X_train, X_val, y_train, y_val = train_test_split(df_train, labels, random_state=42, stratify=labels)
The training data set was split into training X, y datasets, and validation X, y datasets ('test') datasets
A Grid Search was made to find out which models gave a higher train score, test score and ROC AUC score.
The models tested were:
# grid search on best classifier
models = ['lgr','rfc','gbc','abc','knn']
idx = 0
model_solns = {}
for m in models:
log()
idx += 1
items = [m]
[train_score, test_score, roc_auc] = pipeline(items, X_train, X_val, y_train, y_val)
model_solns[idx] = {'model': m,
'train_score': train_score, 'test_score': test_score, 'roc_auc_score': roc_auc,
}
# find the best classifier
df_solns = pd.DataFrame(model_solns).sort_values(ascending=False, by='roc_auc_score',axis=1)
display(df_solns)
The best model was found to be the Gradient Boosting Classifier that has the highest test_score/roc_auc_score 0.8299.
We ran the model using different parameters and found that keeping verbose= 1 and cv = 3 was the best.
log()
# perform gradient boosting classifier
pipe = Pipeline([
('gbc', GradientBoostingClassifier()),
])
params_grid_cv = {
}
gs = GridSearchCV(pipe, param_grid=params_grid_cv, scoring='roc_auc', verbose=1, cv=3)
gs.fit(df_train, labels)
# log()
# # create predictions and submission file
# predictions = gs.predict_proba(df_test)[:,1]
# df_sample['WnvPresent'] = predictions
# now = str(datetime.now()).replace(':', '_').replace('.', '_').replace('-', '_').replace(' ', '_')
# df_sample.to_csv(f'../data/model_preds_{now}.csv', index = False)
# print(f'Exported to ../data/model_preds_{now}.csv')
Team's best Kaggle Score:
log()
# importing in the predictions
df_sample = pd.read_csv('../data/model_preds_2019_11_07_17_18_06_258582.csv')
df_sample
log()
#bringing back in the test data
df_conclude = pd.read_csv('../data/test.csv')
log()
#merging the test data and predictions
df_conclude = df_conclude.merge(df_sample)
log()
# locating test data with high probability of having WNV presence
df_conclude = df_conclude.loc[df_conclude['WnvPresent'] >= 0.5]
log()
# saving out
df_conclude.to_csv('../data/conclude.csv')
We look at the the points where the presence of where the presence of WNV is high.
log()
# plot a map to show where the probability of WNV presence is above 0.5
concl=pd.read_csv('../data/conclude.csv')
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(df_test.Longitude, df_test.Latitude, zorder=1, c='B', s=300, label = 'Traps')
ax.scatter(concl.Longitude, concl.Latitude, zorder=1, c='R', s=300, label = 'WNV Present')
ax.set_title('Plot of Test locations and WNV high probability locations', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
ax.legend()
Set a threshold of 0.5 to filter only locations has highest probability of WNV spreading. Out of predicted location only 50 locations had highest probability of WNV spreading more than 0.5.
log()
spray_target=list(concl.Date.unique())
spray_target
log()
# Plot of WNV predicted locations for 2008 September
temp_df=concl[concl['Date'].str.contains(spray_target[0])]
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c='r', s=300)
ax.set_title('Plot of WNV predicted locations for 2008 September', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
log()
#Plot of WNV predicted locations for 2010 August
temp_df=concl[concl['Date'].str.contains(spray_target[1])]
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c='r', s=300)
ax.set_title('Plot of WNV predicted locations for 2010 August', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
log()
# Plot of WNV predicted locations for 2012 August
temp_df=concl[concl['Date'].str.contains(spray_target[2])]
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c='r', s=300)
ax.set_title('Plot of WNV predicted locations for 2012 August', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
log()
temp_df=concl[concl['Date'].str.contains(spray_target[3])]
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c='r', s=300)
ax.set_title('Plot of WNV predicted locations for 2012 September', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
log()
# Plot of WNV predicted locations for 2014 August
temp_df=concl[concl['Date'].str.contains(spray_target[4])]
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(temp_df.Longitude, temp_df.Latitude, zorder=1, c='r', s=300)
ax.set_title('Plot of WNV predicted locations for 2014 August', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
As shown, all the plots show that there is a high congregation of traps in near the O'Hare Airport area that have high probability of having the WNV vector.
Merging of Weather Data to spray targets
Station 1 data was chosen because it is the closest to all trap areas.
#converting the Date into Datetime variable
stn1['Date'] = [datetime.strptime(x, '%Y-%m-%d') for x in stn1['Date']]
# onlY temperature and precipitation and sealevel data was chosen as
# there was a relationship visually found earlier on
edited_stn1 = stn1[['Date', 'Tmax', 'Tmin', 'Tavg', 'DewPoint', 'WetBulb', 'Heat',
'Cool', 'PrecipTotal', 'SeaLevel']]
edited_stn1.info()
#Set date as index
edited_stn1.set_index('Date', inplace = True)
# looking at all the precipitation values
edited_stn1.PrecipTotal.unique()
# replacing missing values with 0
edited_stn1['PrecipTotal'] = edited_stn1['PrecipTotal'].replace((-1),0) #assuming no data means no rain
edited_stn1.PrecipTotal.unique()
#changing the type of data in PrecipTotal
edited_stn1["PrecipTotal"] = [float(x) for x in edited_stn1["PrecipTotal"]]
#making rain status
edited_stn1["rainstat"] = [1 if x >= 0.1 else 0 for x in edited_stn1["PrecipTotal"]]
#changing the date in concl dataframe into Datetime variable
df_conclude['Date'] = [datetime.strptime(x, '%Y-%m-%d') for x in df_conclude['Date']]
df_conclude['Date'].dtype
sum2 = edited_stn1.rolling(2).sum()
sum3 = edited_stn1.rolling(3).sum()
sum4 = edited_stn1.rolling(4).sum()
sum5 = edited_stn1.rolling(5).sum()
sum6 = edited_stn1.rolling(6).sum()
sum7 = edited_stn1.rolling(7).sum()
sum8 = edited_stn1.rolling(8).sum()
sum9 = edited_stn1.rolling(9).sum()
sum10 = edited_stn1.rolling(10).sum()
sum11 = edited_stn1.rolling(11).sum()
sum12 = edited_stn1.rolling(12).sum()
sum13 = edited_stn1.rolling(13).sum()
sum14 = edited_stn1.rolling(14).sum()
sum15 = edited_stn1.rolling(15).sum()
sum16 = edited_stn1.rolling(16).sum()
sum17 = edited_stn1.rolling(17).sum()
sum18 = edited_stn1.rolling(18).sum()
sum19 = edited_stn1.rolling(19).sum()
sum20 = edited_stn1.rolling(20).sum()
sum21 = edited_stn1.rolling(21).sum()
sum22 = edited_stn1.rolling(22).sum()
sum23 = edited_stn1.rolling(23).sum()
sum24 = edited_stn1.rolling(24).sum()
sum25 = edited_stn1.rolling(25).sum()
sum26 = edited_stn1.rolling(26).sum()
sum27 = edited_stn1.rolling(27).sum()
sum28 = edited_stn1.rolling(28).sum()
sum29 = edited_stn1.rolling(29).sum()
sum30 = edited_stn1.rolling(30).sum()
#creating a list of dataframes created using rolling sum
sumweatherlist = []
count = 2
while count <31:
sumweatherlist.append(('sum'+str(count)))
count +=1
print(sumweatherlist)
# converting the rainsum data into a dictionary
rainstat2_dict = {x:sum2.loc[x]['rainstat'] for x in sum2.index}
rainstat3_dict = {x:sum3.loc[x]['rainstat'] for x in sum3.index}
rainstat4_dict = {x:sum4.loc[x]['rainstat'] for x in sum4.index}
rainstat5_dict = {x:sum5.loc[x]['rainstat'] for x in sum5.index}
rainstat6_dict = {x:sum6.loc[x]['rainstat'] for x in sum6.index}
rainstat7_dict = {x:sum7.loc[x]['rainstat'] for x in sum7.index}
rainstat8_dict = {x:sum8.loc[x]['rainstat'] for x in sum8.index}
rainstat9_dict = {x:sum9.loc[x]['rainstat'] for x in sum9.index}
rainstat10_dict = {x:sum10.loc[x]['rainstat'] for x in sum10.index}
rainstat11_dict = {x:sum11.loc[x]['rainstat'] for x in sum11.index}
rainstat12_dict = {x:sum12.loc[x]['rainstat'] for x in sum12.index}
rainstat13_dict = {x:sum13.loc[x]['rainstat'] for x in sum13.index}
rainstat14_dict = {x:sum14.loc[x]['rainstat'] for x in sum14.index}
rainstat15_dict = {x:sum15.loc[x]['rainstat'] for x in sum15.index}
rainstat16_dict = {x:sum16.loc[x]['rainstat'] for x in sum16.index}
rainstat17_dict = {x:sum17.loc[x]['rainstat'] for x in sum17.index}
rainstat18_dict = {x:sum18.loc[x]['rainstat'] for x in sum18.index}
rainstat19_dict = {x:sum19.loc[x]['rainstat'] for x in sum19.index}
rainstat20_dict = {x:sum20.loc[x]['rainstat'] for x in sum20.index}
rainstat21_dict = {x:sum21.loc[x]['rainstat'] for x in sum21.index}
rainstat22_dict = {x:sum22.loc[x]['rainstat'] for x in sum22.index}
rainstat23_dict = {x:sum23.loc[x]['rainstat'] for x in sum23.index}
rainstat24_dict = {x:sum24.loc[x]['rainstat'] for x in sum24.index}
rainstat25_dict = {x:sum25.loc[x]['rainstat'] for x in sum25.index}
rainstat26_dict = {x:sum26.loc[x]['rainstat'] for x in sum26.index}
rainstat27_dict = {x:sum27.loc[x]['rainstat'] for x in sum27.index}
rainstat28_dict = {x:sum28.loc[x]['rainstat'] for x in sum28.index}
rainstat29_dict = {x:sum29.loc[x]['rainstat'] for x in sum29.index}
rainstat30_dict = {x:sum30.loc[x]['rainstat'] for x in sum30.index}
#Encoding the rainstat data into the df_conclude dataframe
df_conclude['rainstat2'] = [rainstat2_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat3'] = [rainstat3_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat4'] = [rainstat4_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat5'] = [rainstat5_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat6'] = [rainstat6_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat7'] = [rainstat7_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat8'] = [rainstat8_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat9'] = [rainstat9_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat10'] = [rainstat10_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat11'] = [rainstat11_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat12'] = [rainstat12_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat13'] = [rainstat13_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat14'] = [rainstat14_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat15'] = [rainstat15_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat16'] = [rainstat16_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat17'] = [rainstat17_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat18'] = [rainstat18_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat19'] = [rainstat19_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat20'] = [rainstat20_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat21'] = [rainstat21_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat22'] = [rainstat22_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat23'] = [rainstat23_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat24'] = [rainstat24_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat25'] = [rainstat25_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat26'] = [rainstat26_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat27'] = [rainstat27_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat28'] = [rainstat28_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat29'] = [rainstat29_dict[x] for x in df_conclude['Date']]
df_conclude['rainstat30'] = [rainstat30_dict[x] for x in df_conclude['Date']]
# creating a list of newly encoded precipitation data
rainstatlist = []
count = 2
while count <31:
rainstatlist.append(('rainstat'+str(count)))
count +=1
print(rainstatlist)
# function created to do correlation between the WNV probability and rainstat mean
pearson(rainstatlist, df_conclude)
#making a column of WNV probability >0.5
df_conclude['wmv0.5'] = [1 if x >= 0.5 else 0 for x in df_conclude['WnvPresent']]
rainsum_result = {}
def ttesttaker(df, lst, resultdict):
"""
does t-test for you for WNV status
Parameters
-----------
df: Dataframe of concern
lst: list of columns in dataframe
resultdict: dictionary to place data in
Returns
-------
resultdict: containing the mean of each group and t-statistic and p value
"""
for a in lst:
a_dict = {}
group1 = df[df['wmv0.5'] == 0][a]
group2 = df[df['wmv0.5'] == 1][a]
t, p = stats.ttest_ind(group1, group2)
a_dict = {'0': (group1.mean()), '1': (group2.mean()), 't': t, 'p': (round((p*1000),1))/1000}
resultdict[a] = a_dict
ttesttaker(df_conclude, rainstatlist, rainsum_result)
#dictionary of results
rainsum_result
#results in dictionary
rainstat_df = pd.DataFrame(rainsum_result).T
rainstat_df
A higher probability of WNV presence (> 0.5) is linked to less days of rain from starting from 14 days.
Any rolling sum less than 7 days were ignored as the mosquitos requires at least 7 days to reach adulthood.
log()
# plot a map to show where the traps should be sprayed probability of WNV presence is above 0.5
concl=pd.read_csv('../data/conclude.csv')
mapdata = plt.imread('../images/map1.png')
fig, ax = plt.subplots(figsize = (16,9))
ax.scatter(df_test.Longitude, df_test.Latitude, zorder=1, c='B', s=300)
ax.scatter(concl.Longitude, concl.Latitude, marker='*', zorder=1, c='R', s=1500, label = 'Spray Region')
ax.set_title('Plot of Test locations and WNV high probability locations', size = 15)
BBox=(-88,-87.5,41.7,42.05)
ax.set_xlim(BBox[0],BBox[1])
ax.set_ylim(BBox[2],BBox[3])
ax.imshow(mapdata, zorder=0, extent = BBox, aspect= 'equal');
ax.legend()
In addition, there was a correlation with higher precipitation for a span of at least 6 days with higher WNV presence probability.
Thus, it is recommended that spraying should be done in the region in red above after 6 days of high precipitation.
log()
concl[concl['WnvPresent']>=0.5]
log()
#Min and Max Latitude
print(concl[concl['WnvPresent']>=0.5]['Latitude'].min(),concl[concl['WnvPresent']>=0.5]['Latitude'].max())
log()
#Min and Max Longitude
print(concl[concl['WnvPresent']>=0.5]['Longitude'].min(),concl[concl['WnvPresent']>=0.5]['Longitude'].max())
log()
concl[concl['WnvPresent'] >=0.5].sort_values(by="Latitude")
log()
concl[concl['WnvPresent'] >=0.5].sort_values(by="Longitude")
Assuming that the area is enclosed by the four points: (42.011601, -87.811506), (41.992478 -87.862995), (41.954690 -87.733974), (41.953705 -87.733974), we calculate the area using this tool here to be approximately $19949766.9m^2$. This is approximately 4930 acres.
The estimated average cost for half an acre is \$350, and so the expected costs per year is rounded up to \\$3.5 mil a year.
Initial medical costs for WNV can start from \$4,617 to \\$25,117 per patient, and \$2,271 - \\$22,628 for long term care. In 2018, 176 WNV patients were reported. This would mean that at worst case scenario, at the initial care cost, it would cost \$4.4mil followed by an additional \\$ 4mil for long term care. This clearly outweighs the cost of prevention and may present an economic burden, societal burden and infrastructure burden. Thus it is in the city's best interest to apply preventative measures to prevent mosquito breeding.